Statistical Filtering and Subcategorization Frame Acquisition

نویسندگان

  • Anna Korhonen
  • Genevieve Gorrell
  • Diana McCarthy
چکیده

Research "into the automatic acquisition of subcategorization frames (SCFS) from corpora is starting to produce large-scale computational lexicons which include valuable frequency information. However, the accuracy of the resulting lexicons shows room for improvement. One significant source of error lies in the statistical filtering used by some researchers to remove noise from automatically acquired subcategorization frames. In this paper, we compare three different approaches to filtering out spurious hypotheses. Two hypothesis tests perform poorly, compared to filtering frames on the basis of relative frequency. We discuss reasons for this and consider directions for future research. 1 I n t r o d u c t i o n Subcategorization information is vital for successful parsing, however, manual development of large subcategorized lexicons has proved difficult because predicates change behaviour between sublanguages, domains and over time. Additionally, manually developed sucategorization lexicons do not provide the relative frequency of different SCFs for a given predicate, essential in a probabilistic approach. Over the past years acquiring subcategorization dictionaries from textual corpora has become increasingly popular. The different approaches (e.g. Brent, !991, 1993; Ushioda et al., 1993; Briscoe and Carroll, 1997; Manning, 1993; Carroll and Rooth, 1998; Gahl, 1998; Lapata, 1999; Sarkar and Zeman, 2000) vary largely according to the methods used and the number of SCFS being extracted. Regardless of this, there is a ceiling on the performance of these systems at around 80% token recall 1 zWhere token recall is the percentage .of SCF tokens in a sample of manually analysed text that were The approaches to extracting SCF information from corpora have frequently employed statistical methods for filtering (e.g. Brent, 1993; Manning 1993; Briscoe and Carroll, 1997; Lapata, 1999). This has been done to remove the noise that arises when dealing with naturally occurring data, and from mistakes made by the SCF acquisition system, for example, parser errors. Filtering is usually done with a hypothesis test, and frequently with a variation of the binomial filter introduced by Brent (1991, 1993). Hypothesis testing is performed by formulating a null hypothesis, (H0), which is assumed true unless there is evidence to the contrary. If there is evidence to the contrary, H0 is rejected and the alternative hypothesis (H1) is accepted. In SCF acquisition, H0 is that there is no association between a particular verb (verbj) and a SCF (SCFi), meanwhile H1 is that there is such an association. For SCF acquisition, the test is one-tailed since H1 states the direction of the association, a positive correlation between verbj and scfi. We compare the expected probability of scfi occurring with verbj if H0 is true, to the observed probability of co-occurrence obtained from the corpus data. If the observed probability is greater than the expected probability we reject Ho and accept H1, and if not, we retain H0. Despite the popularity of this method, it has been reported as problematic. According to one account (Briscoe and Carroll, 1997) the majority of errors arise because of the statistical filtering process, which is reported to be particularly unreliable for low frequency SCFs (Brent, 1993; Briscoe and Carroll, 1997; Manning, 1993; Manning and Schiitze, 1999). Lapata (1999) reported that a threshold on the relative frequencies produced slightly better results than those achieved with a Brentcorrectly acquired by the system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subcategorization acquisition

Manual development of large subcategorised lexicons has proved difficult because predicates change behaviour between sublanguages, domains and over time. Yet access to a comprehensive subcategorization lexicon is vital for successful parsing capable of recovering predicate-argument relations, and probabilistic parsers would greatly benefit from accurate information concerning the relative likel...

متن کامل

Automatic Acquisition of a Large Subcategorization Dictionary from Corpora

This paper presents a new method for producing a dictionary of subcategorization frames from unlabelled text corpora. It is shown that statistical filtering of the results of a finite state parser running on the output of a stochastic tagger produces high quality results, despite the error rates of the tagger and the parser. Further, it is argued that this method can be used to learn all subcat...

متن کامل

The Automatic Acquisition Of Frequencies Of Verb Subcategorization Frames From Tagged Corpora

We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, o...

متن کامل

Using Semantically Motivated Estimates to Help Subcategorization Acquisition

Research into the automatic acquisition of subcategorization frames from corpora is starting to produce large-scale computational lexicons which include valuable frequency information. However, the accuracy of the resulting lexicons shows room for improvement. One source of error lies in the lack of accurate back-off estimates for subcategorization frames, delimiting the performance of statisti...

متن کامل

Lexical Knowledge Acquisition from Corpora

The paper presents a computational environment to support developing a lexicon for natural language processing. The underlying idea of the environment is to utilize up-to-date language technologies to minimize both the human labor and the inconsistency that are unavoidable in manual compilation of a lexicon. The proposed computational environment enables an efcient construction of a consistent ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000